A Taxonomically-informed Mass Spectrometry Search Tool for Microbial Metabolomics Data

Abstract MicrobeMASST, a taxonomically-informed mass spectrometry (MS) search tool, tackles limited microbial metabolite annotation in untargeted metabolomics experiments. Leveraging a curated database of >60,000 microbial monocultures, users can search known and unknown MS/MS spectra and link them to their respective microbial producers via MS/MS fragmentation patterns. Identification of microbial-derived metabolites and relative producers, without a priori knowledge, will vastly enhance the understanding of microorganisms' role in ecology and human health.


Abstract
MicrobeMASST, a taxonomically-informed mass spectrometry (MS) search tool, tackles limited microbial metabolite annotation in untargeted metabolomics experiments. Leveraging a curated database of >60,000 microbial monocultures, users can search known and unknown MS/MS spectra and link them to their respective microbial producers via MS/MS fragmentation patterns. Identification of microbial-derived metabolites and relative producers, without a priori knowledge, will vastly enhance the understanding of microorganisms' role in ecology and human health.

Main
Microorganisms drive the global carbon cycle 1 and can establish symbiotic relationships with host organisms, influencing their health, aging, and behavior 2-6 . Microbial populations interact with different ecosystems through the alteration of available metabolite pools and the production of specialized small molecules 7,8 . The vast genetic potential of these communities is exemplified by human-associated microorganisms, which encode approximately 100 times more genes than the human genome 9,10 . However, this metabolic potential remains unreflected in modern untargeted metabolomics experiments, where typically <1% of the annotated molecules can be classified as microbial. This problem particularly affects mass spectrometry (MS)-based untargeted metabolomics, a common technique to investigate molecules produced or modified by microorganisms 11 , which famously struggles with spectral annotation of complex biological samples. This is because the majority of spectral reference libraries are biased towards commercially available or otherwise accessible standards of primary metabolites, drugs, or industrial chemicals. Even when metabolites are annotated, extensive literature searches are required to understand whether these molecules have microbial origins and to identify the respective microbial producers. Public databases, such as KEGG 12 , MiMeDB 13 , NPAtlas 14 , and LOTUS 15 , can assist in this interpretation, but they are mostly limited to wellestablished, largely genome-inferred, metabolic models or to fully characterized and published molecular structures. Additionally, while targeted metabolomics efforts aimed at interrogating the gut microbiome mechanistically have been developed 16 , these focus only on relatively few commerciallyavailable microbial molecules. Hence, the majority of the microbial chemical space remains unknown, despite the continuous expansion of MS reference libraries. To fill this gap, we have developed microbeMASST (https://masst.gnps2.org/microbemasst/), a search tool that leverages public MS repository data to identify the microbial origin of known and unknown metabolites and map them to their microbial producers.
MicrobeMASST is a community-sourced tool that works within the GNPS 17 ecosystem. Users can search tandem MS (MS/MS) spectra obtained from their experiments against MS/MS spectra previously detected in other extracts of bacterial, fungal, or archaeal monocultures. No other available resource or tool allows linking uncharacterized MS/MS spectra to characterized microorganisms . The microbeMASST reference database of monocultures has been generated through years of community contributions and metadata curation, and it contains microorganisms isolated from plants, soils, oceans, lakes, fish, terrestrial animals, and humans (Figure 1a). All available microorganisms are categorized according to the NCBI taxonomy 18 at different taxonomic resolution (i.e. species, genus, family, etc.) or mapped to the closest taxonomically accurate level, if no NCBI ID was available at the time of database creation. As of June 2023, microbeMASST includes 60,781 liquid chromatography (LC)-MS/MS files, comprising >100 million MS/MS spectra, mapped to 541 strains, 1,336 species, 539 genera, 264 families, 109 orders, 41 classes, and 16 phyla from the three domains of life: Bacteria, Archaea, and Eukaryota (Figure 1b). Differently from MASST 19 , which uses a precomputed network of ~110 million MS/MS spectra to enable spectral searching, microbeMASST is based on the newly introduced Fast Search Tool (https://fasst.gnps2.org/fastsearch/) 20 . This tool, originally designed for proteomics, drastically improves search speed by several orders of magnitude by indexing all the MS/MS spectra present in GNPS/MassIVE and restricting the search space to the user input parameters. Because of this, search results are returned within seconds as opposed to 20 min per search or 24-48 hours for modification tolerant searches in the original implementation of MASST. Additionally, microbeMASST leverages the pre-curated file-associated metadata to aggregate results into taxonomic trees. This represents a major enhancement over MASST, where users have to manually inspect result tables and contextualize them, making interpretations tedious.
In microbeMASST, users can search MS/MS spectra using a Universal Spectrum Identifier (USI) 21 or by inputting a precursor ion mass and its spectral fragmentation pattern (Supplementary Figure 1). Analogue search can also be enabled to discover molecules related to the MS/MS spectrum of interest across the taxonomic tree 17,19,22 . The microbeMASST web app displays query results in interactive taxonomic trees, which can be downloaded as HTML files. Nodes in the trees represent specific taxa and display rich information, such as taxon scientific name, NCBI taxonomic ID, number of deposited samples, number of found MS/MS matches, and proportion of found matches, which is also visualized through pie charts. Information for an MS/MS match in a particular taxon is propagated upstream through its lineage. The reactive interface of microbeMASST enables filtering of the tree to specific taxonomic levels or to a minimum number of matches observed per taxon. Additionally, three data tables are generated, linking the search job to other resources in the GNPS/MassIVE ecosystem. Each MS/MS query is searched against the public MS/MS reference library of GNPS (587,213 MS/MS spectra, June 2023). Annotations to such reference compounds are listed under the 'Library matches' tab (Supplementary Figure 2a). The 'Datasets matches' tab contains information on the matching scans, displaying scientific name, NCBI taxonomic ID and taxonomic rank, number of matching fragment ions, and modified cosine score together with a link to a mirror plot visualization (Supplementary Figure 2b). Finally, the 'Taxa matches' tab informs on how many matches were found per taxon and number of samples available for that taxon (Supplementary Figure 2c). Quality controls (QCs) and blank samples (n=2,902) present in the reference datasets of microbeMASST have been retained to provide information on possible contaminants and media components. Additionally, data from human cell line cultures (n=1,199) have been included to enable assessment of whether molecules can be produced by both human hosts and microorganisms. Examples of medically-relevant small molecules known to be produced by bacteria or fungi. Lovastatin, a cholesterol lowering drug originally isolated from Aspergillus genus 24 , salinosporamide A, a Phase III candidate to treat glioblastoma produced by Salinispora tropica 25 , and commendamide, a human G-protein-coupled receptor agonist 26 . d) MicrobeMASST search outputs of the three different molecules of interest confirm that they were exclusively found in monocultures of the only known producers. Pie charts display the proportion of MS/MS matches found in the deposited reference database. Blue indicates a match with a monoculture, while yellow represents a nonmatch. Searches were performed using MS/MS spectra deposited in the GNPS reference library: lovastatin (CCMSLIB00005435737), salinosporamide A (CCMSLIB00010013003), and commendamide (CCMSLIB00004679239).
Search results for lovastatin, salinosporamide A, and commendamide MS/MS spectra highlight how microbeMASST can correctly connect microbial molecules to their known producers (Figure 1c). In the case of lovastatin, a clinically-used cholesterol-lowering drug originally isolated from Aspergillus terreus 24 , spectral matches were unique to the genus Aspergillus (Figure 1d). The MS/MS spectrum for salinosporamide A, a Phase III candidate to treat glioblastoma 27 , only matched two strains of Salinispora tropica (Figure 1d), the only known producer 25 . Commendamide, first observed in cultures of Bacteroides vulgatus (recently reclassified as Phocaeicola vulgatus), is a G-protein-coupled receptor agonist 26 . Surprisingly it had many matches to several bacterial cultures, including in Flavobacteriaceae (Algibacter, Lutibacter, Maribacter, Polaribacter, Postechiella, and Winogradskyella) and Bacteroides cultures (Figure 1d). Additional examples include searches of mevastatin, arylomycin A4, yersiniabactin, promicroferrioxamine, and the microbial bile acid conjugates 28-30 glutamate-cholic acid (Glu-CA) and glutamate-deoxycholic acid (Glu-DCA) (Supplementary Figure 3). Mevastatin, another cholesterol-lowering drug originally isolated from Penicillium citrinum 31 , was only found in samples classified as fungi. The antibiotic arylomycin A4 was observed in different Streptomyces species and it was originally isolated from Streptomyces sp. Tue 6075 in 2002 32 . Yersiniabactin, a siderophore originally isolated from Yersinia pestis 33 , whose monoculture is not yet present in the reference database of microbeMASST, was observed in Escherichia coli and Klebsiella species, consistent with previous observations 34,35 . Promicroferrioxamine, another siderophore, was observed to match Micromonospora chokoriensis and Streptomyces species. This molecule was originally isolated from an uncharacterized Promicromonosporaceae isolate 36 . The MS/MS spectrum of the gut microbiota-derived Glu-CA, an amidated tri-hydroxylated bile acid, was most frequently observed in cultures of Bifidobacterium species, while Glu-DCA was found only in one Bifidobacterium strain but also in two Enterococcus and Clostridium species. None of the aforementioned molecules were found in cultured human cell lines, highlighting the ability of microbeMASST to distinguish MS/MS spectra of molecules that can be exclusively produced by either bacteria or fungi. It is important to acknowledge that MS/MS data generally do not differentiate stereoisomers, but it can nevertheless provide crucial information on molecular families.
MicrobeMASST can be also used to extract microbial information from mass spectrometrybased metabolomics studies without any a priori knowledge. To illustrate this, we reprocessed an untargeted metabolomics study comparing germ-free (GF) mice to those harboring microbial communities, also known as specific pathogen-free (SPF) mice 29 (Figure 2a). We extracted 10,047 consensus MS/MS spectra uniquely present in SPF mice and queried them with microbeMASST. A total of 3,262 MS/MS spectra were found to have a microbial match. Of these, 837 were also found in human cell lines and for this reason were removed from further analysis. Among the remaining 2,425 MS/MS spectra, 1,673 were exclusively found in bacteria, 95 in fungi, and 657 in both (Supplementary Figure  4). These MS/MS spectra were then processed with SIRIUS 37 and CANOPUS 38 to tentatively annotate the metabolites and identify their chemical classes. A file containing all these spectra of interest can be explored and downloaded as .mgf format from GNPS (see Methods). To further validate the microbial origin of these MS/MS spectra, we assessed their overlap with data acquired from a different study comparing SPF mice treated with a cocktail of antibiotics to untreated controls 40 . Interestingly, 621 MS/MS spectra were also found in this second dataset and 512 were only present in animals not treated with antibiotics (Figure 2b). The distribution of these spectra and their classes across bacterial phyla was visualized using an UpSet plot 39 (Figure 2c). Notably, the majority of the spectra classified as terpenoids were commonly observed across phyla while amino acids and peptides appeared to be more phylum specific. Of these 512 spectra, 23% had a level 2 or 3 annotation 41 , matching against the GNPS reference libraries (Supplementary Table 1). These included the recently described amidated microbial bile acids 19,[28][29][30][42][43][44][45][46][47] , free bile acids originating from the hydrolysis of host derived taurine bile acid conjugates 48 , keto bile acids formed via microbial oxidation of alcohols 29 , N-acyl-lipids belonging to a similar class of metabolites as commendamide 26 (a microbial N-acyl lipid), di-and tripeptides seen in microbial digestion of proteins 49 , and soyasapogenol, a byproduct of the microbial digestion of complex saccharides from dietary soyasaponins 29 . Part of the remaining unannotated spectra can be identified as chemical modifications of the above annotated microbial metabolites through spectral similarity obtained from molecular networking (Supplementary Figure 5). Based on literature information, the list of annotated MS/MS spectra contained a small number of metabolites traditionally considered to be non-microbial in origin. One interpretation of this finding is that microorganisms are capable of producing metabolites previously described to only be made by mammalian hosts. Notable examples include serotonin 50 , γ-aminobutyric acid (GABA) 51 , and the glycocholic acid 42,52-54 , with microorganisms often being the primary producers of these metabolites in the gut. Additionally, an alternative hypothesis is that microorganisms can also selectively stimulate the production of host metabolites. Other limitations regarding annotations are discussed in Methods.
To assess if the observations from the mouse models translate to humans, we searched and found that 455 out of the 512 MS/MS spectra of interest matched to public human data (Figure 2d). Interestingly, these spectra were found in both healthy individuals and individuals affected by different health states, including type II diabetes, inflammatory bowel disease (IBD), Alzehimer's diseases and other conditions. These spectra were most commonly found in stool samples (n=110,973 MS/MS matches) followed by blood, breast milk, and the oral cavity as well as other organs including the brain, skin, vagina, and biofluids, such as cerebrospinal fluid and urine (Figure 2e). These findings support the concept that a significant number of microbial metabolites reach and influence distant organs in the human body 55 .
We anticipate microbeMASST will be a key resource to enhance understanding of the role of microbial metabolites across a wide range of ecosystems, including oceans, plants, soils, insects, animals, and humans. This expanding resource will enable the scientific community to gain valuable taxonomic and functional insights into diverse microbial populations. The mass spectrometry community will play a key role in the evolution of this tool in the future through the continued deposition of data associated with novel microbial monocultures and the expansion of spectral reference libraries. Moreover, microbeMASST holds potential for various applications, ranging from aquaculture and agriculture to biotechnology and the study of microbial-mediated human health conditions. By harnessing the power of public data, we can unlock new opportunities for advancements in multiple fields and deepen our understanding of the intricate relationships between microorganisms and their ecosystems.

Data and code availability
Data used to generate the reference database of microbeMASST are publicly available at GNPS/MassIVE (https://massive.ucsd.edu/). A list with all the accession numbers (MassIVE IDs) of the studies used to generate this tool is available in Supplementary Table 2. Interactive examples of the MS/MS queries illustrated in Figure 1d and Supplementary Figure 3 can be generated, visualized, and downloaded from the microbeMASST website (https://masst.gnps2.org/microbemasst/). Known molecules already present in the GNPS library (https://library.gnps2.org/) were used to facilitate interpretation and confirm that specific bacterial and fungal molecules were exclusively observed in the respective monocultures. -

Data collection and harmonization
Data deposited in GNPS/MassIVE was investigated manually and systematically, using ReDU 23 (https://redu.ucsd.edu/), to extract all the publicly available MS/MS files (.mzML or .mzXML formats) acquired from monocultures of bacteria, fungi, archaea, and human cell lines. Only monocultures were included in this search tool to unequivocally associate the production of the detected metabolites to each specific taxon. A total of 60,781 files from 537 different GNPS/MassIVE datasets were selected to be used as a reference database of microbeMASST (Supplementary Table 2). These comprise files deposited in response to our call to the scientific community. Between May and July 2022, 25 different research groups deposited 65 distinct datasets in GNPS/MassIVE, comprising a total of 3,142 unique LC-MS/MS files. This represented a 5.45% increase in publicly available MS/MS data acquired from monocultures in just two months. To qualify as a contributor and be credited as one of the authors, researchers had to deposit high resolution LC-MS/MS data acquired either in positive or negative ionization modes from monocultures of either bacteria, fungi, or achaea. Harmonization of the acquired data and metadata represented a challenge. The NCBI taxonomic database is constantly expanding and evolving and ReDU latest updated (December 2021) does not accommodate the latest deposited taxa. For this reason, an additional metadata file (microbeMASST_metadata_massiveID) was generated specifically for the microbeMASST project and uploaded to the respective GNPS/MassIVE datasets deposited by the collaborators, if the ReDU workflow failed. All the collected information was finally